feat: clarify the expected behavior, and rationale, of the post join filter #807

westonpace · 2025-04-23T12:43:26Z

The post join filter has very little explanation. It can also be confusing because, from a purely logical perspective, it is possible to see the post join filter as redundant. This PR attempts to clarify the description of the post join filter.

yongchul · 2025-04-23T13:49:23Z

site/docs/relations/logical_relations.md

-| Post-Join Filter | A boolean condition to be applied to each result record after the inputs have been joined, yielding only the records that satisfied the condition. | Optional                           |
+| Post-Join Filter | A boolean condition to be applied to each potential match between the left and right
+inputs.  If it evaluates to false then the potential match is not considered a match.  A join relation with
+Join Expression X and Post-Join Filter Y is equivalent to a join relation with Join Expression X AND Y. | Optional                           |


Thank you! Much better than my local draft! :)

Two more things.

Align Hash/MergeJoin post-join filter description with this. We could refer JoinRel there and leave what's different.

Should this be Optional, default True like hash/merge join?

Align Hash/MergeJoin post-join filter description with this. We could refer JoinRel there and leave what's different.

Can you expand on what you mean here? The PR does currently update the hash/merge join descriptions. I don't include the A join relation with Join Expression X and Post-Join Filter Y is equivalent to a join relation with Join Expression X AND Y statement because this is not true for hash/merge join (the join expression for these relations is a series of equality conditions).

I meant the language of the description. The way you describe is more explicit that the post_join_filter IS part of the join condition, say saying what "matches" and what "does not match". This is not for try to reduce the output.

Also, can we drop Equi form HashEquiJoin? :)

drin · 2025-04-23T17:35:16Z

A join relation with Join Expression X and Post-Join Filter Y is equivalent to a join relation with Join Expression X AND Y

Is this strictly true? As in a consumer must resolve both expressions on the same inputs? If so, I think it'd be nice to add a comment in the .proto file to the effect of "post_join_filter should be resolved in conjunction (AND) with expression."

drin · 2025-04-23T18:00:02Z

proto/substrait/algebra.proto

+  // The post-join filter is a filter that is applied to the result of the join before an output
+  // record is produced.  If the filter evaluates to false then the record is not considered a
+  // match.


I think this is ambiguous for functions that aggregate over many tuples. I think a "simple" example is:

the post_join_filter is a comparison (lte) that uses a window function (count)

the expression is a predicate with selectivity between 0 and 100%

the expression produces many tuples from one input for a single tuple of the other input

(2) and (3) are necessary for ambiguous scenarios to occur and (1) is where the ambiguity is expressed.

so, I think "applied to the result of the join before an output record is produced" lends itself to being misunderstood because the "result" of the join sounds like the result of applying expression, but I think to be accurate to "equivalent to a join relation with Join Expression X AND Y" you must evaluate post_join_filter on the inputs to expression even if you only evaluate its "truthyness" on joined records that expression evaluates as true.

Maybe something more like "applied to the inputs of the join, before an output record is produced" is better and equally concise?

Agree that it's better to clarify the predicates are evaluated over the inputs. Like @drin 's suggestion.

+1 to "evaluated over the inputs". As for when it's applied, I'm still not too sure about what is the supposed behavior tbh. Let's say you're joining two tables a LEFT JOIN b ON ... with a post-join filter that has a.Col1 = b.Col2. Is a.Col1 = b.Col2 expression also supposed to follow join type semantics and leave the unmatched records from the left side in the output? Or will the result be as if it had been an inner join instead of a left join?

I get very confused easily when talking about when this filter is applied. Here is my understanding, in naive pseudocode, of how it is applied. I'm omitting right joins, full outer joins, single joins, and mark joins for simplicity.

for left_record in left_records: has_match = False for right_record in right_records: if join_expression(left_record, right_record) and post_join_filter(left_record, right_record): has_match = True if join_type == Inner or join_type == Left: emit(combine(left_record, right_record): if has_match and join_type == LeftSemi: emit(left_record) elif not has_match and join_type == LeftAnti: emit(left_record) elif not has_match and join_type == Left: emit(combine(left_record, null))

If someone has an alternate proposal, is it possible to share your own pseudocode representation?

@tokoko the predicates in join condition does not follow join type. The join type becomes into play depending on whether there is a matching row (i.e., intersection) or not. outer joins and antisemi joins should ensure that you have no rows that matches according to JoinRel.expression AND JoinRel.post_join_filter to correctly behave (i.e., whether to produce null padded rows (outer) or include in the output (antisemi)).

makes sense. For JoinRel, can't we write something like "post-join filter is supposed to be evaluated as if it's part of the join expression" or something similar? It would be a lot simpler to understand imho rather than thinking through when during the operation it's supposed to be applied/evaluated.

@tokoko that's why I initially proposed to drop post_join_filter from JoinRel in the slack discussion. :)

I agree with that pseudocode weston. My intention is to make the wording clearly reflect that post_join_filter(left_record, right_record) is valid and post_join_filter(combine(left_record, right_record)) is invalid.

Note (for completeness) that my naive reading of "post join" was incorrect and would have been to implement:

# above pseudocode here ... if post_join_filter(emitted_record): really_emit(emitted_record)

Yup 'post_join' really tripped me and reason i started the thread. In the systems i worked, used residual join condition/predicate rather than post.

jacques-n · 2025-05-04T20:53:03Z

It's difficult to follow the threads in this discussion.

One can think of a a join with a post join filter as a composite operation that is a join followed by a filter relation. It is entirely valid translation to take a post join filter out of the join and put in a filter relation directly afterwards and vice versa.

The post join filter does not logically interact with the join type at all. The composite exists because many systems have it and it can be a beneficial physical pattern. The reason it has to be stated separately from the join predicate is to have covering behavior of all possible filter conditions. I always have to remind myself of which conditions can and cannot be moved into a join evaluation clause.

I'm supportive of clarifying the text if people are unclear as to what post join filter means.

westonpace · 2025-05-05T17:14:28Z

One can think of a a join with a post join filter as a composite operation that is a join followed by a filter relation. It is entirely valid translation to take a post join filter out of the join and put in a filter relation directly afterwards and vice versa.

@jacques-n

This is not the conclusion we came to. I believe the content of the PR is still accurate with the threads, so you can just review the content and ignore the discussion.

For example:

SELECT * FROM a LEFT OUTER JOIN b ON a.id = b.id AND b.other_field IS NOT NULL

As it stands this filter will emit one row for each row in a. If the filter is moved into the WHERE clause then it will emit fewer rows (less or equal to the number of rows emitted by an inner join).

From your GPT link this matches:

Conditions on the non-preserved side that would otherwise eliminate rows that should remain when there's no match

drin · 2025-05-05T18:23:49Z

There is a thread that discusses the description that would go in the website: discussion on website description

I think further discussion on this can be deferred until agreement on post_join_filter semantics is finalized.

Then, there's a thread discussing the comment in the .proto file for post_join_filter: discussion on comment in spec

This discussion assumes that Weston's assertion in the description is the correct semantics of post_join_filter. As Weston points out, that description is in contradiction to Jacques's comment.

This is not the conclusion we came to

In this PR, we never collectively discussed what it should be versus what it is. The description says:

The post join filter has very little explanation... from a purely logical perspective, it is possible to see the post join filter as redundant.

And I asked:

Is this strictly true? As in a consumer must resolve both expressions on the same inputs? If so, ...

One thing that was referenced in slack is the substrait FAQ: "The post-join filter on the various Join relations is not always equivalent to an explicit Filter relation AFTER the Join." This FAQ then references velox hash-join implementation, which says: "Filter is optional. If specified it can be any expression over the results of the join."

It occurs to me that the FAQ says "post-join filter... is not always equivalent to an explicit Filter relation AFTER the join," yet the referenced velox documentation says "If specified, it can be any expression over the results of the join." These seem directly contradictory to me, since "the results of of the join" sounds quite a bit like "AFTER the join".

Clarify the expected behavior, and rationale, of the post join filter

3ad21f6

westonpace requested review from EpsilonPrime, cpcloud, jacques-n and vbarua as code owners April 23, 2025 12:43

yongchul approved these changes Apr 23, 2025

View reviewed changes

drin reviewed Apr 23, 2025

View reviewed changes

EpsilonPrime approved these changes May 2, 2025

View reviewed changes

feat: clarify the expected behavior, and rationale, of the post join filter #807

Are you sure you want to change the base?

feat: clarify the expected behavior, and rationale, of the post join filter #807

Uh oh!

Conversation

westonpace commented Apr 23, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

drin commented Apr 23, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

drin Apr 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

westonpace Apr 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

drin Apr 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jacques-n commented May 4, 2025

Uh oh!

westonpace commented May 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

drin commented May 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

drin Apr 23, 2025 •

edited

Loading

westonpace Apr 23, 2025 •

edited

Loading

drin Apr 23, 2025 •

edited

Loading

westonpace commented May 5, 2025 •

edited

Loading